I chose to do my final project on the Titanic Data Set. I chose this particular dataset because of it’s popularity among young data scientists. It is one of the easiest datasets to begin with for learning to build a predictive model. I wanted to get familar with it so that I can do this myself in the near future, but also thought it would be a fun topic to do the project on.
My Question is: What variables are most important for indicating whether or not someone survived the Titanic disaster?
## # A tibble: 6 × 8
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 3 1 22 1 0 7.25 1
## 2 1 1 2 38 1 0 71.3 2
## 3 1 3 2 26 0 0 7.92 1
## 4 1 1 2 35 1 0 53.1 1
## 5 0 3 1 35 0 0 8.05 1
## 6 0 1 1 54 0 0 51.9 1
-Class & Fare Price: Class is cateogorical and is split into 3 categories of (1,2,3). Fare is numeric. There is a moderate negative correlation (-0.55) between the two variables. This indicates that as the price of the fare increases, class decreases. The assumption that higher fares are associated with 1st class (1) can already be made, but it is nice knowing that it is statistically correct as well.
-Age & Class: Low negative correlation (-0.37). Indicates as Age increases, class decreases (closer to 1st class). This assumption can be made that older people have more money than younger passengers.
-Class: ‘Class’ has a low negative correlation (-0.36). Class gets worse (economically) as it increases. (1st class is 1, 2nd is 2, and 3rd is 3). Therefore, wealthier people are more likely to be part of first class. ‘Survived’ indicates that the passenger died (0) or survived (1), meaning the higher the variable, the better chances of survival. Therefore, the correlation indicates that as survival increases, class decreases. This indicates that wealthier people (people of 1st class), or more likely to survive
-Sex: ‘Sex’ has a moderate positive correlation (0.54). As sex increases (man (1) to woman (2)), so does survival (death (1) to survival (2)). This indicates that women were more likely to survive.
Age Categories: -Baby (0-2) -Toddler (2-5) -Child (5-13) -Teen (13-20) -Adult (20-40) -MAA (40-60) -Senior (60+) This plot shows that mostly Adults (Ages 20-40) were on board. Middle-Aged Adults the second most common age category. There appears to be an upwards trend until Adults, and a downwards trend following Adults. There are also more male passengers than female passengers in just about every age category.
This Map shows the 3 ports that the Titanic departed from. Southampton can be observed as the port with the largest number of passengers. A little of over half of the passengers from Southampton are third class. The remaining are split up somewhat evenly. The second largest port is Cherbourg, which had over half of its passengers in first class. The smallest port is Queenstown, which had 77 passengers and 72 of them were third class.
The plot shows the volume of passengers, as well as fare price for each class. Third class is the largest, first class in the middle, and second class last largest. As expected, the price goes up as you get closer to First Class.
This plot shows the survival rate of passengers by their age category and gender. It can be observed that Females survived at a significanly higher rate than Males. For women, it appears that the survival rate for babies and seniors was 100%. All other cateogories seem to be similar. For men, the chances of survival decrease as age increases.
## # A tibble: 889 × 10
## # Groups: Embarked [3]
## Embarked Pclass Survived Total…¹ Embar…² lat long Embar…³ Embar…⁴ Embar…⁵
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cherbourg 1 0 168 93 49.6 -1.62 75 55.4 44.6
## 2 Cherbourg 1 1 168 93 49.6 -1.62 75 55.4 44.6
## 3 Cherbourg 1 1 168 93 49.6 -1.62 75 55.4 44.6
## 4 Cherbourg 3 0 168 93 49.6 -1.62 75 55.4 44.6
## 5 Cherbourg 1 1 168 93 49.6 -1.62 75 55.4 44.6
## 6 Cherbourg 1 1 168 93 49.6 -1.62 75 55.4 44.6
## 7 Cherbourg 3 1 168 93 49.6 -1.62 75 55.4 44.6
## 8 Cherbourg 2 0 168 93 49.6 -1.62 75 55.4 44.6
## 9 Cherbourg 3 0 168 93 49.6 -1.62 75 55.4 44.6
## 10 Cherbourg 1 1 168 93 49.6 -1.62 75 55.4 44.6
## # … with 879 more rows, and abbreviated variable names ¹TotalEmbarked,
## # ²EmbarkedSurvived, ³EmbarkedDied, ⁴EmbarkedSRate, ⁵EmbarkedDRate
## # A tibble: 9 × 6
## # Groups: Embarked, Pclass [9]
## Embarked Pclass TotalEmbarkedPclass TotalEmbPclassSur TotalEmbPclas…¹ Death
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Southampton Third 353 67 19.0 81.0
## 2 Cherbourg First 85 59 69.4 30.6
## 3 Southampton First 127 74 58.3 41.7
## 4 Queenstown Third 72 27 37.5 62.5
## 5 Cherbourg Second 17 9 52.9 47.1
## 6 Southampton Second 164 76 46.3 53.7
## 7 Cherbourg Third 66 25 37.9 62.1
## 8 Queenstown First 2 1 50 50
## 9 Queenstown Second 3 2 66.7 33.3
## # … with abbreviated variable name ¹TotalEmbPclassSurPerc
## # A tibble: 891 × 5
## Survived Pclass Sex Embarked cat
## <chr> <chr> <chr> <chr> <chr>
## 1 Died Third Male Southampton Adult
## 2 Survived First Female Cherbourg Adult
## 3 Survived Third Female Southampton Adult
## 4 Survived First Female Southampton Adult
## 5 Died Third Male Southampton Adult
## 6 Died Third Male Queenstown <NA>
## 7 Died First Male Southampton MAA
## 8 Died Third Male Southampton Toddler
## 9 Survived Third Female Southampton Adult
## 10 Survived Second Female Cherbourg Teen
## # … with 881 more rows
## # A tibble: 38 × 10
## # Groups: cat, Sex, Pclass [38]
## cat Sex Pclass CSCT S D SR DR cat1 group1
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Teen Female Second 8 8 0 100 0 Teen Female Second All S…
## 2 Toddler Female Second 4 4 0 100 0 Toddler Female Se… All S…
## 3 Child Female Second 4 4 0 100 0 Child Female Seco… All S…
## 4 Baby Male Second 5 5 0 100 0 Baby Male Second All S…
## 5 Teen Female First 13 13 0 100 0 Teen Female First All S…
## 6 Baby Female Third 4 4 0 100 0 Baby Female Third All S…
## 7 Toddler Male Second 3 3 0 100 0 Toddler Male Seco… All S…
## 8 Senior Female First 3 3 0 100 0 Senior Female Fir… All S…
## 9 Baby Male First 1 1 0 100 0 Baby Male First All S…
## 10 Toddler Male First 1 1 0 100 0 Toddler Male First All S…
## # … with 28 more rows